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OFFICE OF PETITIONS 



BACKGROUND OF THE INVENTION 



1 . 



Field of the Invention 



This invention relates to computer architecture. 
In particular, this invention relates to the design of 
10 an instiruction unit in a superscalar processor. 

2. Discussion of the Related Art 

Parallelism is extensively exploited in modern 
computer designs. Among these designs are two distinct 
15 architectures which are known respectively as the very 
long instruction word (VLIW) architecture and the 
superscalar architecture. A superscalar processor is a 
computer which can dispatch one, two or more 
instructions simultaneously. Such a processor 

2 0 typically includes multiple functional units which can 

independently execute the dispatched instructions. In 
such a processor, a control logic circuit, which has 
come to be known as the "grouping logic" circuit, 
determines the instructions to dispatch (the 

25 "instruction group"), according to certain resource 

allocation and data dependency constraints. The task 
of the computer designer is to provide a grouping logic 
circuit which can dynamically evaluate such constraints 
to dispatch instruction groups which optimally use the 

30 available resources. A resource allocation constraint 
can be, for instance, in a computer with a single 
floating point multiplier unit, the constraint that no 
more than one floating point multiply instruction is to 
be dispatched for any given processor cycle. A 

3 5 processor cycle is the basic timing unit for a 

pipelined unit of the processor, typically the clock 
period of the CPU clock. An example of a data 
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dependency constraint is the avoidance of a "read- 
after-write" hazard. This constraint prevents 
dispatching an instruction which requires an operand 
from a register which is the destination of an write 
5 instruction dispatched earlier, but yet to be 
unre tired . 

A VLIW processor, unlike a superscalar processor, 
does not dynamically allocate system resources at run 
time. Rather, resource allocation and data dependency 

10 analysis are performed during program compilation. A 
VLIW processor decodes the long instruction word to 
provide the control information for operating the 
various independent functional units. The task of the 
compiler is to optimize performance of a program by 

15 generating a sequence of such instructions which, when 
decoded, efficiently exploit the program's inherent 
parallelism in the computer's parallel hardware. The 
hardware is given little control of instruction 
sequencing and dispatch. 

2 0 A VLIW computer, however, has a significant 

drawback in that its programs must be recompiled for 
each machine they run on. Such recompilation is 
required because the control information required by 
each machine is encoded in the instruction words. A 

25 superscalar computer, by contrast, is often designed to 
be able to run existing executable programs (i.e., 
"binaries"). In a superscalar computer, the 
instructions of an existing executable program are 
dispatched by the computer at run time according to the 

30 computer's particular resource availability and data 
integrity requirements. From a computer user's point 
of view, because existing binaries represent 
significant investments, the ability to acquire 
enhanced performance without the expense of purchasing 

3 5 new copies of binaries is a significant advantage. 

In the prior art, to determine the instructions 
that go into an instruction group of a given processor 

-2- 



L:VDMS\8242\M-3876_U\0171325.01 



cycle, a superscalar computer performs the resource 
allocation and data dependency checking tasks in the 
immediately preceding processor cycle. Under this 
scheme, the computer designer must ensure that such 
5 resource allocation and data dependency checking tasks 
complete within their processor cycle. As the number 
of the functional units that can be independently run 
increases, the time required for performing such 
resource allocation and data dependency checking tasks 
10 grows more rapidly than linearly. Consequently, in a 
superscalar computer design, the ability to perform 
resource and data integrity analysis within a single 
processor cycle can become a factor that limits the 
performance gain of additional parallelism. 

15 

SUMMARY OF THE INVENTION 

The present invention provides a central 
processing unit which includes a grouping logic circuit 
for determining simultaneously dispatchable 

20 instmctions in an processor cycle. The central 

processing unit of the present invention includes such 
a grouping logic circuit and a number of functional 
units, each adapted to execute one or more specified 
instructions dispatched by the grouping logic circuit. 

25 The grouping logic circuit includes a number of 

pipeline stages, such that resource allocation and data 
dependency checks can be performed over a number of 
processor cycles. The present invention therefore 
allows dispatching a large number of instruction 

3 0 simultaneously, while avoiding the complexity of the 
grouping logic circuit from becoming limiting the 
duration of the central processing unit's processor 
cycle . 

In one embodiment, the grouping logic circuit 
3 5 checks intra -group data dependency immediately upon 

receiving the instruction group. In that embodiment, 
all instruction in a group of instructions received in 
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a first processor cycle are dispatched prior to 
dispatching any instruction of a second group of 
instructions received at an processor cycle subsequent 
to said first processor cycle. 
5 The present invention is better understood upon 

consideration of the detailed description below in 
conjunction with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 
10 Figure 1 is a block diagram of a CPU 10 0, in an 

exemplary 4-way superscalar processor of the present 
invention . 

Figure 2 shows schematically a 4 -stage pipelined 
grouping logic circuit 109 in the 4-way superscalar 
15 processor of Figure 1. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

An embodiment of the present invention is 
illustrated by the block diagram of Figure 1, which 
2 0 shows a central processing unit (CPU) 100 in an 

exemplary 4-way superscalar processor of the present 
invention. A 4-way superscalar pro<iessor fetches, 
dispatches, executes and retires up to four 
instructions per processor cycle. As shown in Figure 

2 5 1, central processing unit 100 includes two arithmetic 

logic units 101 and 102, a load/store unit 103, which 
includes a 9 -deep load buffer 104 and an 8 -deep store 
buffer 105, a floating point adder 106, a floating 
point multiplier 107, and a floating point divider 108. 

3 0 In this embodiment, a grouping logic circuit 10 9 

dispatches up to four instructions per processor cycle. 
Completion unit 110 retires instructions upon 
completion. A register file (not shown) , including 
numerous integer and float point registers, is provided 
35 with sufficient number of ports to prevent contention 

among functional units for access to this register file 
during operand fetch or result write -back. In this 
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erabodiment also, loads are non-blocking, i.e., CPU 100 
continues to execute even though one or more dispatched 
load instructions have not complete. When the data of 
the load instructions are returned from the main 
5 memory, these data can be placed in a pipeline for 

storage in a second- level cache. In this embodiment, 
floating point adder 106 and floating point multiplier 
107 each have a 4 -stage pipeline. Similarly, 
load/store unit 103 has a 2 -stage pipeline. Floating 

10 point divider 108, which is not pipelined, requires 
more than one processor cycle per instruction. 

To simplify the discussion below, the state of CPU 
100 relevant to grouping logic 109 is summarized by a 
state variable S{t), which is defined below. Of 

15 course, the state of CPU 100 includes also other 

variables, such as those conventionally included in the 
processor status word. Those skilled in the art would 
appreciate the use and implementation of processor 
states. Thus, the state S(t) at time t of CPU 100 can 

20 be represented by: 

S(t) = {ALU^(t) , ALU^it) , LS{t) , LB{t) , SB{t), 
FA{t) , FMit) , FSD(t)} 

where ALU^^ (t) and ALU2 (t) are the states, at time t, 

of arithmetic logic units 101 and 102 
respectively; LS (t) and LB (t) are the 
25 states, at time t, of store buffer 105 and 

load buffer 104 respectively; FA{t), FM(t), 
and FDS(t) are the states, at time t, of 
floating point adder 106, floating point 
multiplier 107 and floating point divider 108 
3 0 respectively. 

At any given time, the state of each functional 
unit can be represented by the source and destination 
registers specified in the instructions dispatched to 
the functional unit but not yet retired. Thus, 
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ALU^ = {ALU^. rsl(t) , ALU^.rs2{t) , ALU^.zd(t)} 



where rsl(t), rs2(t) and rd(t) are respectively the 

first and second source registers, and the 
destination of registers of the instruction 
5 executing at time t in arithmetic logic unit 

101 . 

Similarly, the state of arithmetic logic unit 102 
can be defined as : 

ALU^ = {ALU^ . rsl ( t) , ALU^ . rs2 ( t) , ALU^ . rd ( t) } 



10 For pipelined functional units, such as floating 

point adder 106, the state is relatively more complex, 
consisting of the source and destination registers of 
the instructions in their respectively pipeline , Thus, 
for the pipelined units, i.e., load/store unit 103, 

15 load buffer 104, store buffer 105, floating point adder 
106, and floating point multiplier 107, their 
respective states, at time t, LS(t), LB(t), SB(t), 
FA(t) and FM(t) can be represented by: 

LS = {LS.rsl^it) , LS.rs2^{t) , LS.rd^it)} fori={l, 2} 



LB = {LB.rsl^it), LB.rs2^{t), LB.rd^(t)} 
for i={l,2, . . . ,9} 



20 



SB = iSB.rsl^it) , SB.rs2^{t) , SB.xd^it)} 
for i= {1,2, ... ,8} 



FA = {FA.rsl^it), FA.rs2^{t) , FA.rdj(t)} for i={l, . . . ,4} 



Finally, floating point divider 108 ' s state FSD{t) 
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FM = {FM.rsl^{t) , FM.rs2^{t) . FM.rd^(t)} fori={±,...,4} 

can be represented by: 

FDS = {FDS. rsl^ ( t) , FDS. IS2- ( t) , FDS. rd^ ( t) } 

State variable S(t) can be represented by a memory 
element, such as a register or a content addressable 
5 memory unit, at either a centralized location or in a 
distributed fashion. For example, in the distributed 
approach, the portion of state S{t) associated with a 
given functional unit can be implemented with the 
control logic of the functional unit. 

10 In the prior art, a grouping logic circuit would 

determine from the current state, S(t) at time t, the 
next state S{t+1), which includes information necessary 
to dispatch the instructions of the next processor 
cycle at time t+1. For example, to avoid a read-after- 

15 write hazard, such a grouping circuit would exclude 
from the next state S (t+1) an instruction having an 
operand to be fetched from a register designated for 
storing a result of a yet incomplete instruction. As 
another example, such a grouping circuit would include 

20 in state S(t+1) no more than one floating point "add" 
instruction in each processor cycle, since only one 
floating point adder (i.e. floating point adder 106) is 
available. As discussed above, as complexity 
increases, the time recpaired for propagating through 

25 the grouping logic circuit can become a critical path 
for the processor cycle. Thus, in accordance with the 
present invention, grouping logic circuit 109 is 
pipelined to derive, over t processor cycles, a future 
state S(t+T) based on the present state S(t) . The 

30 future state S(t+T) determines the instruction group to 
dispatch at time t+T. Pipelining grouping logic 109 is 
possible because, as demonstrated below, (i) the values 
of most state variables in the state S{t+T) can be 
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estimated from corresponding values of state S(t) with 
sufficient accuracy, and (ii) for those state variables 
for which values can not be accurately predicted, it is 
relatively straightforward to provide for all possible 
5 outcomes of state S(t+r) , or to use a conservative 

approach (i.e. not dispatching an instruction when such 
an instruction could have been dispatched) with a 
slight penalty on performance. 

The process for predicting state S(t+T) is 

10 explained next. The following discussion will first 
show that most components of next state S(t+1) can be 
precisely determined from present state S(t), and the 
remaining components of state S(t) can be reasonably 
determined, provided that certain non-deterministic 

15 conditions are appropriately handled. By induction, it 
can therefore be shown that future state SCt+r), where 
T is greater than 1, can likewise be determined from 
state S (t) . 

Since an instruction in floating point adder 106 
20 or floating point multiplier 107 completes after four 
processor cycles and an instruction in load/store unit 
103 completes after two processor cycles, the states 
FA, FM and LS at time t+1 can be derived from the 
corresponding state S(t) at time t, the immediately 
25 preceding processor cycle. In particular, the 

relationship governing the source and destination 
registers of each instruction executing in floating 
point adder 106, floating point multiplier 107 and 
load/store unit 103 between time t+1 and time t are: 

rsl^{t+l) = rsli_^{t) , for Ki^k 

30 

rs2^(t+l) = rs2^_^{t) , for l<i<.k 
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rd^{t+i) = rd^_^{t) , for Ki^k 

where k is the depth of the respective pipeline. 

The state FSD(t+l) of floating point divider 108, 
in which the time required to execute an instruction 
5 can exceed an processor cycle, is determined from state 
FSD(t) by: 

FSDit+1) = FSD(t) { if last stage] else null 

Whether or not floating point divider 108 is in its 
last stage can be determined from, for example, a 
10 hardware counter or a state register, which keep tracks 
of the number of processor cycles elapsed since the 
instruction in floating point divider 108 began 
execution. 

In load buffer 104 and store buffer 105, since the 
15 pending read or write operation at the head of each 

queue need not complete within one processor cycle, the 
state LB(t+l) at time t+1 cannot be determined from the 
immediately previous state LB(t) at 'time t with 
certainty. However, since state LB (t+1) can only 

2 0 either remain the same, or reflect the movement of the 

pipeline by one stage, two possible approaches to 
determine state LB (t+1) can be used. First, a 
conservative approach would predict LB (t+1) to be the 
same as LB(t) . Under this approach, when load buffer 
25 104 is full, an instruction is not dispatched until the 
pipeline in load buffer 106 advances. An incorrect 
prediction, i.e. a load instruction completes during 
the processor cycle of time t, this conservative 
approach leads to a penalty of one processor cycle, 

3 0 since a load instruction could have been dispatched at 

time t + 1. Alternatively, a more aggressive approach 
provides for both outcomes, i.e. load buffer 104 
advances one stage, and load buffer 104 remains the 
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same. Under this aggressive approach, grouping logic 
10 9 is ready to dispatch a load instruction, such 
dispatch to be enabled by a control signal which 
indicates, at time t+1, whether a load instruction has 
5 in fact completed. This aggressive approach requires 
more a complex logic circuit than the conseirvative 
approach . 

Thus, the skilled person would appreciate that 
state S(t+1) of CPU 100 can be predicted from state 

10 S(t) . Consequently, both the number of instructions 

and the types of instructions that can be dispatched at 
time t+1 (i.e. the instruction group at time t+1) based 
on predicted state S (t+l) can be derived, at time t, 
from state S(t), subject to additional handling based 

15 on the actual state S^{t+1) at time t+1. 

The above analysis can be can be extended to allow 
state S(t+T) at time t+r to be derived from state S(t) 
at time t. The instruction group at time t+r can be 
derived from time t, provided that, for each 

20 instruction group between time t and t+r, all 

instruction from that instruction group must be 
dispatched before any instruction from a subsequent 
instruction group is allowed to be dispatched (i.e. no 
instruction group merging) . 

2 5 Since instructions from different instruction 

groups are not merged, intra-group dependencies and 
inter-group dependencies can be checked in parallel. 
The instructions are either fetched from an instruction 
cache or an instruction buffer. An instruction buffer 

3 0 is preferable in a system in which not all accesses 

(e.g. branch instructions) to the instruction cache are 
aligned, and multiple entry points in the basic blocks 
of a program are allowed. 

Once four candidate instructions for an 
35 instruction group are identified, intra-group data 
dependency checking can begin. Because of the 
constraint against instruction group merging described 



-10- 



above, i.e., all instructions in an instruction group 
must be dispatched before an instruction from a 
subsequent instruction group can be dispatched, intra- 
group dependency checking can be accomplished in a 
5 pipelined fashion. That is, intra-group dependency 

checking can span more than one processor cycle and all 
inter-group dependency checking can occur independently 
of inter-group dependency checking. For the purpose of 
intra-group dependency check, each instruction group 
10 can be represented by: 

IntraSit) = {rsl^it) , rs2j^{t) , rd^{t) , res^Ct)} 

for 

where W is the width of the machine, and res^ 
represents the resource utilization of instruction I . 
An example of a four-stage pipeline 200 is shown in 
15 Figure 2. In Figure 2, at first stage 201, as soon as 
the instruction group is constituted, intra-group 
dependency checking is performed immediately. 
Thereafter, at stage 202, resource allocation within 
the instruction group can be determined. At stage 203, 
20 intergroup decisions, e.g. resource allocation 

decisions taking into consideration resource allocation 
in previous instruction groups, are merged with the 
decisions at stages 2 01 and 202. For example, if the 
present instruction group includes an instruction 
25 designated for floating point divider 108, stage 203 
would have determined at by this time if a previous 
instruction using floating point divider 108 would have 
completed by the time the present instruction group is 
due to be dispatched. Finally, at stage 2 04, non- 
30 deterministic conditions, e.g. the condition at store 
buffer 105, is considered. Dispatchable instructions 
are issued into CPU 100 at the end of stage 204. 

The above detailed description is provided to 



-11- 



L.uJMi\K242\M-3876 U\0171325.01 



illustrate the specific errbodiments of the present 
invention and is not intended to be limiting. Numerous 
variations and modifications within the scope of the 
present invention are possible. The present invention 
5 is defined by the following claims. 
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CLAIMS 

I claim: 

1. A central processing unit, comprising: 
a plurality of functional units, each 
functional unit adapted to execute an instruct ion, - 
and 

a grouping logic circuit, including a number 
of pipeline stages and receiving, at each 
processor cycle, a group of instructions, said 
grouping logic circuit dispatching each of said 
instructions to be executed by one of said 
fiinctional units . 



2. A central processing unit as in Claim 1, 
15 wherein said grouping logic circuit checks data 
dependency among said group of instructions to 
determine whether said group of instructions can be 
dispatched simultaneously. 

2 0 3. A central processing unit as in Claim 1, 

wherein said grouping logic circuit checks for resource 
contention within said group of instructions. 

4. A central processing unit as in Claim 1, 
25 wherein said grouping logic circuit checks data 

dependency of an instruction group at one processor 
cycle and a group of instruction received in a previous 
processor cycle. 

2 0 5. A central processing unit as in Claim 1, 

wherein the state of said central processing unit is 
represented in a register, said state including 
representation of destination registers of instructions 
in said group of instructions. 

35 

6. A central processing unit as in Claim 1, 
wherein all instruction in a group of instructions 
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received in a first processor cycle are dispatched 
prior to dispatching any instruction of a second group 
of instructions received at an processor cycle 
subsequent to said first processor cycle. 

5 

7. A central processing unit as in Claim 1, 
wherein said functional units include a pipelined 
functional unit capable of receiving an instruction 
every processor cycle and completing said instruction 
10 at a subsequent processor cycle. 



8. A central processing unit as in Claim 1, 
wherein said functional units include a functional unit 
requiring multiple processor cycles to complete an 

15 instruction executed at said functional unit. 

9. A central processing unit as in Claim 1, 
wherein said grouping logic circuit derives a state 
vector for a group of instiructions received at a first 

20 processor cycle based on a number of state vectors 

derived for groups of instructions received in a number 
of processor cycles immediately preceding said first 
processor cycle, said number of processor cycles being 
equal to said number of pipeline stages . 
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PIPELINED INSTRUCTION DISPATCH UNIT 
IN A SUPERSCALAR PROCESSOR 
Marc Tremblay 

ABSTRACT OF THE DISCLOSURE 

A pipelined instruction dispatch or grouping 
circuit allows instruction dispatch decisions to be 
made over multiple processor cycles. In one 
embodiment, the grouping circuit performs resource 
allocation and data dependency checks on an instruction 
group, based on a state vector which includes 
representation of source and destination registers of 
instructions within said instruction group and 
corresponding state vectors for instruction groups of a 
number of preceding processor cycles . 
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